This project considers the burdens of poor air quality and its health consequences, and how they fall on different American communities. To explore and understand these relationships, I use daily site-level testing data from the EPA’s Air Quality System for fine particulate matter (PM2.5) from 2019, linked with 2018 data from the 5-year American Community Survey for the tract surrounding each testing site. In developing a plan for this project I met with Drs. Anil Vachani, Gary Weissman, and John P. Reilly, who provided insights on the causal pathway from pollution sources to specific EPA-measured pollutants to short- and long-term health outcomes; potential data sources and analytic approaches; the ways in which “natural experiments” related to major changes in pollutant generation have been used to compare the effects of perturbation from a steady state; and more. Materials for this project can be found in this GitHub repository.
Very small particulate matter in the air can be inhaled deep into the lungs, causing respiratory and cardiovascular health problems such as asthma attacks, bronchitis, heart attacks, and more. Particulate matter is just one type of air pollution, but particulate matter pollution and other types of air pollution such as carbon monoxide, nitrogen oxides, and sulfur dioxide are often generated by similar sources and thus, highly correlated with one another. Particulate matter as a pollutant is classified as either fine or coarse, with fine being defined as less than 2.5 microns in diameter (PM2.5) while coarse is larger than 2.5 microns but less than 10 microns. Combustion is the major generator of PM2.5, especially from car exhaust, coal- or gas-fired power plants, or burning wood. 12 micrograms per cubic meter is considered the maximum acceptable concentration of PM2.5 in ambient air. This corresponds to the boundary between the “Good” (green) and “Moderate” (yellow) zones for the Air Quality Index (AQI) as updated in 2012.
Poor air quality can have a substantial negative influence on health and quality of life, but different households have different resources, incentives, and trade-offs to consider in determining where to live, especially since the sources of pollution like factories and highways can also be essential connections to work, family, and friends. As a result, we might expect that the relationship between place of residence and air quality is more complex than people with more wealth, income, and social power concentrating in places with better air quality. By considering many different community characteristics in building models to predict poor air quality, I aim to learn which factors are most strongly associated with air quality and how successfully models trained on community characteristics can predict poor air quality.
I begin by defining a function to create a data frame from the results of a GET call to the EPA’s Air Quality System API. I create objects with relevant reference lists and define a vector for the “lower 48” U.S. states and Washington, D.C. for later use.
# Retrieve environment variables with EPA AQS API credentials
api.email <- Sys.getenv("EPA_AQS_EMAIL")
api.key <- Sys.getenv("EPA_AQS_KEY")
# Function to create a data frame based on an API call
download.AQS <- function(whichData,customParams) {
rootpath <- "https://aqs.epa.gov/data/api/"
morepath <- paste0(rootpath,whichData)
fullParams <- append(list(email = api.email, key = api.key),customParams)
step1.json <- httr::GET(url = morepath, query = fullParams)
step2.parsed <- jsonlite::fromJSON(httr::content(step1.json,"text"), simplifyVector = FALSE)
step3.df <- tibble(Data = step2.parsed$Data) %>% unnest_wider(.,Data)
return(step3.df)
}
# Fetch documentation for daily data summaries, state code list
dailyData.dictionary <- download.AQS("metaData/fieldsByService",list(service = "dailyData"))
list.states <- download.AQS("list/states",list())
# Omitting state codes other than 48 states + D.C.
list.states.lower48 <- list.states %>% filter(!(code %in% c("02","15","66","72","78","80","CC")))
list.states.lower48
lower48.codes <- as.vector(as.matrix(list.states.lower48[,1]))To manage downloading daily PM2.5 data for each state for all of 2019, I define another function to loop through the state FIPS codes for given date parameters. Executing this step with 1 call per month of data takes roughly 90 minutes, so I save the resulting data frame and load it in the chunk below.
# Pull daily PM2.5 (parameter 88101) data for these states, with arguments for date range
get.daily.lower48 <- function(b,e){
collector <- data.frame()
for (s in seq_along(lower48.codes)) {
newstate <- download.AQS("dailyData/byState", list(param="88101", bdate=b, edate=e, state=list.states.lower48[s,1]))
collector <- bind_rows(collector,newstate)
}
return(collector)
}
# Run in monthly batches
dailyPM2.5_201901 <- get.daily.lower48("20190101","20190131")
dailyPM2.5_201902 <- get.daily.lower48("20190201","20190228")
dailyPM2.5_201903 <- get.daily.lower48("20190301","20190331")
dailyPM2.5_201904 <- get.daily.lower48("20190401","20190430")
dailyPM2.5_201905 <- get.daily.lower48("20190501","20190531")
dailyPM2.5_201906 <- get.daily.lower48("20190601","20190630")
dailyPM2.5_201907 <- get.daily.lower48("20190701","20190731")
dailyPM2.5_201908 <- get.daily.lower48("20190801","20190831")
dailyPM2.5_201909 <- get.daily.lower48("20190901","20190930")
dailyPM2.5_201910 <- get.daily.lower48("20191001","20191031")
dailyPM2.5_201911 <- get.daily.lower48("20191101","20191130")
dailyPM2.5_201912 <- get.daily.lower48("20191201","20191231")
# Concatenate monthly batch files into a single file for 2019
all.2019.dailyPM2.5 <- bind_rows( dailyPM2.5_201901
,dailyPM2.5_201902
,dailyPM2.5_201903
,dailyPM2.5_201904
,dailyPM2.5_201905
,dailyPM2.5_201906
,dailyPM2.5_201907
,dailyPM2.5_201908
,dailyPM2.5_201909
,dailyPM2.5_201910
,dailyPM2.5_201911
,dailyPM2.5_201912)
# Save full 2019 raw file
saveRDS(all.2019.dailyPM2.5, file = "all2019dailyPM25.RDS")I test combinations of testing site location identifiers and add a single site_id variable to the 2019 PM2.5 data set by concatenating location identifiers. I also create a data frame with one observation per testing site.
# Load the full 2019 raw file
all.2019.dailyPM2.5 <- readRDS("all2019dailyPM25.RDS")
str(all.2019.dailyPM2.5)
# Create a file of all unique combinations of geographic indicators for testing sites
testing.sites <- all.2019.dailyPM2.5 %>%
dplyr::select(state_code, state,
county_code, county,
city, cbsa_code, cbsa,
site_number, local_site_name, site_address,
latitude, longitude) %>%
unique() %>%
arrange(state_code,county_code,site_number)
dim(testing.sites)
# 965 combinations
# Create a single unique ID column for site
testing.sites %>% dplyr::select(site_number) %>% unique() %>% dim()
testing.sites %>% dplyr::select(state_code,site_number) %>% unique() %>% dim()
testing.sites %>% dplyr::select(state_code,county_code,site_number) %>% unique() %>% dim()
# Only 253 unique site numbers. 770 state-site combos.
# State + county + site number *is* unique (965 combos)
clean.2019.dailyPM2.5 <- all.2019.dailyPM2.5 %>%
mutate(site_id = paste0(state_code,county_code,site_number))Using the tigris::tracts function, I retrieve the geometries, GEOIDs, and other spatial information for census tracts and join each AQS testing site to the data for the tract within which it lies, based on the latitude and longitude from AQS. I looping through the state FIPS codes one at a time and stack the results into a single crosswalk file. This step is time-consuming so I save the resulting file and load it in the next step.
# Prepare for spatial merge. Create sf file from testing site coordinates
testing.sites.sf.points <- st_as_sf(testing.sites, coords = c('longitude','latitude'), crs = 4269)
# Download census tract sf shapefiles, one state at a time.
# Perform spatial joins one state at a time to avoid having to build the national tract shapefile.
for (s in seq_along(lower48.codes)) {
newstate <- lower48.codes[s]
newtracts <- tracts(newstate)
newjoin <- st_join(newtracts, testing.sites.sf.points, join = st_contains, left=FALSE)
if (s==1){
testing.sites.tracts <- newjoin
} else {
testing.sites.tracts <- bind_rows(testing.sites.tracts,newjoin)
}
}
# Save file with testing sites <> census tract
saveRDS(testing.sites.tracts, file = "testing_sites_tracts.RDS")From tidycensus I now retrieve the variable list for the 2018 5-year American Community Survey. After reviewing the concepts from those variables as well as the questionnaire, I create an object with a set of variables to test for associations with the AQS data.
testing.sites.tracts <- readRDS("testing_sites_tracts.RDS")
head(testing.sites.tracts)
# Load a data frame with all 5-year ACS variables
census.1yr.vars <- load_variables(dataset = "acs5",year = 2018)
head(census.1yr.vars)
# Simplify the variable list to make it easier to review the major concepts represented
census.1yr.concept.counts <-
census.1yr.vars %>%
mutate(root.concept = str_split_fixed(concept," BY",2)[,1]) %>%
mutate(root.concept = str_split_fixed(root.concept," \\(",2)[,1]) %>%
# Double escape needed to match open parenthesis!
mutate(root.name = substring(name,1,6)) %>%
group_by(root.name,root.concept) %>%
summarise(obs=n(), .groups="keep") %>%
arrange(root.name,-obs)
# View(sort(unique(census.1yr.concept.counts$root.concept)))
# After reviewing the variable list, create an object with the ones of interest
final.acs.var.list <- census.1yr.vars %>% filter(name %in% c(
"B01002_001" # median age
,"B03002_003","B03002_001" # Hon-Hisp-White-alone // Denom
,"B02009_001","B02001_001" # Black race, alone or in combination // Denom
,"B06008_003","B06008_001" # Now married // Denom
,"B23007_002","B23007_001" # Children under 18yrs in household // Denom
,"B15003_022","B15003_023","B15003_024","B15003_025","B15003_001" # Bachelors // Masters // Professional // Doctorate // Denom
,"B23022_027","B23022_026","B23022_003","B23022_002" # Worked in past 12 months-Female // Denom-Female // p12-Male // Denom-Male
,"B21001_002","B21001_001" # Veteran // Denom
,"B17001_002","B17001_001" # Past 12mo income at or below poverty level, Denom
,"B22010_002","B22010_001" # Received SNAP in past 12mo, Denom
,"B22008_001" # Median household income, past 12mo
,"B25008_003","B25008_001" # Renter occupied, Denom
,"B25024_002","B25024_003","B25024_001" # Single-fam, detached // single-fam, attached // Denom
,"B25064_001" # Median gross rent
,"B25088_002" # Median monthly owner costs for households with a mortgage
,"B08126_004","B08126_002","B08126_001" # INDUSTRY: Manufacturing // Agriculture, fishing, mining // Denom
,"B08006_003","B08006_001" # Drove to work alone // Denom
,"B08013_001" # Aggregate travel time to work in minutes
)) %>% arrange(name)
final.acs.var.listOnce more I loop through the state FIPS codes, retrieving the ACS variables of interest by tract as a flat file using tidycensus::get_acs and preserving the data only for the census tracts that contain AQS testing sites. As this is another slow step, I save the resulting data frame.
head(testing.sites.tracts)
# Keep only what is needed to make the ACS calls in a small data frame
tst.minimal <- testing.sites.tracts %>%
mutate(site_id = paste0(state_code,county_code,site_number)) %>%
dplyr::select(site_id, GEOID, STATEFP, COUNTYFP)
st_geometry(tst.minimal) <- NULL
# Loop through existing list of state codes
lower48.codes
# Make ACS calls one state at a time, only for counties with AQS testing sites
# Preserve estimate columns, ditch MOE
# Join to testing site list
# Stack ACS data together in a single data frame
acs.loop <- function(){
for (st in seq_along(lower48.codes)) {
newstate <- lower48.codes[st]
county.list <- tst.minimal %>%
filter(STATEFP==newstate) %>%
dplyr::select(COUNTYFP) %>%
arrange(COUNTYFP) %>%
as.matrix() %>% as.vector() %>% as.list()
results.acs <- get_acs( geography = "tract"
,variables=as.list(final.acs.var.list$name)
,state = newstate
,county = county.list
,output = "wide")
smaller.acs <- results.acs %>% dplyr::select(GEOID,matches(".*_.*E$"))
linked.acs <- tst.minimal %>%
filter(STATEFP==newstate) %>%
left_join(.,smaller.acs,by="GEOID")
if (st==1){
collect <- linked.acs
} else {
collect <- bind_rows(collect,linked.acs)
}
}
return(collect)
}
testing.sites.acs <- acs.loop()
testing.sites.acs
saveRDS(testing.sites.acs, file = "testing_sites_acs.RDS")With all the ACS data I need, I can create clean versions of the variables to be included in the analysis. I exclude tracts that do not have complete data for all 19 ACS-based variables or with fewer than 500 respondents in the sample (using the denominator from the set of race variables).
testing.sites.acs <- readRDS("testing_sites_acs.RDS")
head(testing.sites.acs)
# Create percentage variables and scaled medians/totals to use in modeling
testing.sites.acs.ready <- testing.sites.acs %>%
dplyr::rename( age.median = B01002_001E
,income.hh.median = B22008_001E
,rent.median = B25064_001E
,home.pmt.median = B25088_002E
,commute.time.agg = B08013_001E) %>%
mutate( race.black.any = 100*B02009_001E/B02001_001E
,ethn.nhisp.white.alone = 100*B03002_003E/B03002_001E
,married = 100*B06008_003E/B06008_001E
,kids = 100*B23007_002E/B23007_001E
,bachelor.plus = 100*(B15003_022E + B15003_023E + B15003_024E + B15003_025E)/B15003_001E
,veteran = 100*B21001_002E/B21001_001E
,manufacturing = 100*B08126_004E/B08126_001E
,farm.fish.mining = 100*B08126_002E/B08126_001E
,commutes.car.alone = 100*B08006_003E/B08006_001E
,renter = 100*B25008_003E/B25008_001E
,single.fam.home = 100*(B25024_002E + B25024_003E)/B25024_001E
,worked.12mo = 100*(B23022_027E + B23022_003E)/(B23022_026E + B23022_002E)
,poverty.12mo = 100*B17001_002E/B17001_001E
,snap.12mo = 100*B22010_002E/B22010_001E
,denominator = B02001_001E) %>%
dplyr::select(-matches("^B[012].*"))
head(testing.sites.acs.ready)
summary(testing.sites.acs.ready)
# testing.sites.acs.ready %>% dplyr::filter(denominator < 1000) %>% View()
# testing.sites.acs.ready %>% dplyr::filter(denominator < 500 | !complete.cases(.)) %>% View()
length(unique(testing.sites.acs.ready$GEOID))
# 953 - not unique by tract
# 34 tracts to exclude based on incomplete data and total denominator
# Bring the aggregates and medians to a comparable scale
testing.sites.acs.complete <-
testing.sites.acs.ready %>%
mutate(commute.time.agg = commute.time.agg/10000
,income.hh.median = income.hh.median/10000
,rent.median = rent.median/100
,home.pmt.median = home.pmt.median/100
) %>%
dplyr::rename(commute.time.agg.10k = commute.time.agg
,income.hh.median.10k = income.hh.median
,rent.median.100 = rent.median
,home.pmt.median.100 = home.pmt.median) %>%
dplyr::filter(denominator >= 500 & complete.cases(.))
summary(testing.sites.acs.complete)
length(unique(testing.sites.acs.complete$GEOID))I join the census tract GEOIDs to the daily PM2.5 data, and clean and summarize the PM2.5 data by tract. Tracts with fewer than 50 days of valid data are excluded.
summary(clean.2019.dailyPM2.5)
# Create a file of site IDs with GEOIDs that have complete ACS data.
site_id.GEOID.xwalk <- testing.sites.acs.complete %>%
dplyr::select(site_id,GEOID) %>%
unique()
GEOID.dailyPM2.5 <- inner_join( site_id.GEOID.xwalk
,clean.2019.dailyPM2.5
,by="site_id")
dim(GEOID.dailyPM2.5)
# 1,396,734 rows
dim(unique(GEOID.dailyPM2.5[,c("GEOID","date_local")]))
# 243,809 tract-days
table(GEOID.dailyPM2.5$event_type, useNA="always")
# Included, Excluded, None
table(GEOID.dailyPM2.5$validity_indicator, useNA="always")
table(GEOID.dailyPM2.5$validity_indicator, GEOID.dailyPM2.5$event_type, useNA="always")
# Y, N. 4282 x N
# Excluding validity_indicator == "N" does not get rid of a meaningful % of events
# GEOID.dailyPM2.5 %>% dplyr::filter(validity_indicator == "N") %>% View()
summary(GEOID.dailyPM2.5$arithmetic_mean)
# Create preliminary outcome variables
GEOID.daily.summary <-
GEOID.dailyPM2.5 %>%
dplyr::filter(validity_indicator=="Y") %>%
mutate(arithmetic_mean = case_when(arithmetic_mean < 0 ~ 0, TRUE ~ arithmetic_mean)) %>%
group_by(GEOID,date_local) %>%
summarise(mean = mean(arithmetic_mean),.groups = "keep")
# table(GEOID.daily.summary$any_event)
summary(GEOID.daily.summary$mean)
GEOID.AQI.outcomes <-
GEOID.daily.summary %>%
mutate(ungreen = case_when(mean > 12.0 ~ 1L, TRUE ~ 0L)) %>%
group_by(GEOID) %>%
summarise( pm2.5_days = n()
,pm2.5_p50 = quantile(mean, 0.50)
,pm2.5_ungreen_days = sum(ungreen)
,.groups="keep")
head(GEOID.AQI.outcomes)
summary(GEOID.AQI.outcomes)
# How many days of valid data do we have per tract?
table(GEOID.AQI.outcomes$pm2.5_days)
# Create final PM2.5 outcome variables
AQI.outcomes.final <-
GEOID.AQI.outcomes %>%
dplyr::filter(pm2.5_days >= 50) %>%
mutate( pm2.5_pct_ungreen = 100 * pm2.5_ungreen_days / pm2.5_days
,pm2.5_flag10_ungreen = case_when(pm2.5_ungreen_days * 10 > pm2.5_days ~ 1L, TRUE ~ 0L)
,pm2.5_flag20_ungreen = case_when(pm2.5_ungreen_days * 5 > pm2.5_days ~ 1L, TRUE ~ 0L)
) %>%
mutate( pm2.5_flag10_ungreen = factor(pm2.5_flag10_ungreen
,levels = c(0,1)
,labels = c("0-10%",">10%"))
,pm2.5_flag20_ungreen = factor(pm2.5_flag20_ungreen
,levels = c(0,1)
,labels = c("0-20%",">20%"))) %>%
dplyr::select(GEOID, pm2.5_p50, pm2.5_pct_ungreen, pm2.5_flag10_ungreen, pm2.5_flag20_ungreen)
# Review created outcome variables
head(AQI.outcomes.final)
summary(AQI.outcomes.final$pm2.5_p50)
summary(AQI.outcomes.final$pm2.5_pct_ungreen)
table(AQI.outcomes.final$pm2.5_flag10_ungreen)
table(AQI.outcomes.final$pm2.5_flag20_ungreen)
AQI.outcomes.final %>% dplyr::filter(pm2.5_flag10_ungreen==">10%") %>% summary()
AQI.outcomes.final %>% dplyr::filter(pm2.5_flag20_ungreen==">20%") %>% summary()With both predictors and outcomes summarized by census tract, the ACS and AQS files can now be joined to create the final data set.
str(AQI.outcomes.final)
str(testing.sites.acs.complete)
# Drop unnecessary geographic variables from the ACS file
acs.joinready <-
testing.sites.acs.complete %>%
dplyr::select(-site_id, -STATEFP, -COUNTYFP) %>%
unique()
dim(acs.joinready)
length(unique(acs.joinready$GEOID))
# 920 rows, unique on GEOID
# Join PM2.5 and ACS data by tract on GEOID
AQI.ACS <- inner_join(AQI.outcomes.final,acs.joinready,by="GEOID")
head(AQI.ACS)
# Prepare the geography/geometry file for linkage
str(testing.sites.tracts)
geom.joinready <-
testing.sites.tracts %>%
dplyr::select(GEOID,NAMELSAD,STATEFP,state,COUNTYFP,county,city,cbsa,cbsa_code,geometry) %>%
unique()
dim(geom.joinready)
length(unique(geom.joinready$GEOID))
geom.joinready %>% group_by(GEOID) %>% summarise(obs=n(),.groups="keep") %>% arrange(-obs) %>% head()
geom.joinready %>% dplyr::filter(GEOID=="06079012304")
# Hard-code one value to be able to keep city attribute unique by GEOID
geom.joinready2 <-
geom.joinready %>%
mutate(city = case_when(GEOID=="06079012304" ~ "Arroyo Grande", TRUE ~ city)) %>%
unique()
dim(geom.joinready2)
length(unique(geom.joinready2$GEOID))
head(geom.joinready2)
# Join geography/geometry file with AQS outcomes and ACS variables
AQI.ACS.geom <- inner_join(geom.joinready2,AQI.ACS,by="GEOID") %>% group_by()
head(AQI.ACS.geom)
class(AQI.ACS.geom)The continuous PM2.5 outcomes show geographic variability. Some patterns are evident; very rural tracts appear to have lower levels of PM2.5 compared to urban centers; the “Rust Belt” and southern California seem to have higher PM2.5 generally; but there do not appear to be large areas within which the analyzed tracts have very consistent values.
Continuous PM2.5 outcomes by tract, mapped:
AQI.ACS.geom.WGS84 <- st_transform(AQI.ACS.geom, 4326) # FROM 4269 TO 4326 to cooperate with leaflet
palette <- colorNumeric("viridis", NULL, reverse=TRUE)
# Median PM2.5 by tract - interactive
popup_msg <- paste0(AQI.ACS.geom.WGS84$county," County, ",AQI.ACS.geom.WGS84$state,", ",AQI.ACS.geom.WGS84$NAMELSAD
,"<br>Metro Area: ",AQI.ACS.geom.WGS84$cbsa
,"<br>Median PM2.5: ",round(AQI.ACS.geom.WGS84$pm2.5_p50,digits=2))
leaflet(AQI.ACS.geom.WGS84) %>%
addPolygons(fillColor = ~palette(pm2.5_p50)
,fillOpacity = 0.9
,weight = 1.5
,color = "black"
,popup = popup_msg) %>%
addTiles() %>%
addLegend("bottomright",
pal=palette,
values=~pm2.5_p50,
title = 'PM2.5',
opacity = 1) %>%
addScaleBar()